# Dr. M. Baron, Statistical Machine Learning class, STAT-427/627

 

# Polynomial Regression

 

 

# Example: QUADRATIC MODEL FOR PREDICTING THE US POPULATION

 

> setwd("C:\\Users\\baron\\Documents\\Teach\\627 Statistical Machine Learning\\Data")

 

> Data = read.csv("USpop.csv")

> names(Data)

[1] "Year"       "Population"

 

> attach(Data)

> plot(Year,Population)

 

 

# LINEAR MODEL

 

> lin = lm(Population ~ Year)

> summary(lin)

 

Call:

lm(formula = Population ~ Year)

 

Coefficients:

              Estimate Std. Error t value Pr(>|t|)   

(Intercept) -2.600e+03  1.691e+02  -15.38 2.98e-13 ***

Year         1.425e+00  8.871e-02   16.06 1.24e-13 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 30.08 on 22 degrees of freedom

Multiple R-squared:  0.9214,    Adjusted R-squared:  0.9178

F-statistic: 257.9 on 1 and 22 DF,  p-value: 1.235e-13

> abline(lin,col="red",lwd=3)

 

# Clearly, the linear model is too inflexible and restrictive, it does not provide a good fit.

# This is underfitting. Notice, however, that R2 in this regression is 0.9193. Without looking

# at the plot, we could have assumed that the model is very good!

 

> predict(lin,data.frame(Year=2030))

       1

291.5174

 

# This is obviously a poor prediction. The US population was already 331 million during the most

# recent Census. So, we probably omitted an important predictor. Residual plots will help us

# determine which one.

# Let’s produce various related plots. Partition the graphics window into 4 parts and use “plot”.

 

> par(mfrow=c(2,2))

> plot(lin)

 

# The first plot shows that a quadratic term has been omitted although

# it is important in the population growth. So, fit a quadratic model.

# Command I(…) means “inhibit interpretation”, it forces R to understand (…)

# literally, as Year squared.

 

> quadr = lm(Population ~ Year + I(Year^2))

> summary(quadr)

 

              Estimate Std. Error t value Pr(>|t|)   

(Intercept)  2.170e+04  5.227e+02   41.52   <2e-16 ***

Year        -2.412e+01  5.493e-01  -43.91   <2e-16 ***

I(Year^2)    6.705e-03  1.442e-04   46.51   <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 3.019 on 21 degrees of freedom

Multiple R-squared:  0.9992,    Adjusted R-squared:  0.9992

F-statistic: 1.389e+04 on 2 and 21 DF,  p-value: < 2.2e-16

 

# A higher R-squared is not surprising. It will always increase when we add

# new variables to the model. The fair criterion is Adjusted R-squared, when

# we compare models with a different number of parameters. Quadratic model has

# Adjusted R-squared = 0.999 comparing with 0.9155 for the linear model.

# Now let’s obtain the fitted values and plot the fitted curve.

 

> Yhat = fitted.values(quadr)

 

> lines(Year,Yhat,col="blue",lwd=3)

 

 

# The quadratic curve fits nearly perfectly. The quadratic growth slowed down only during

# the World War II, although Baby Boomers quickly caught up with the trend.

 

> predict(quadr,data.frame(Year=2030))

       1

364.1572

 

# Now, this is a reasonable prediction for the year 2030.

 

> predict(quadr,data.frame(Year=2030),interval="confidence")

       fit      lwr      upr

1 364.1572 359.9686 368.3459

 

> predict(quadr,data.frame(Year=2030),interval="prediction")

       fit      lwr      upr

1 364.1572 356.6102 371.7042

 

# Food for thought... Are the confidence and predictions intervals valid here?